feat(bench): live observe→steer join (real worker + real observer)#195
Conversation
…r observer The merged observe-steer-workspace-loop.mts proves the join through a mock observer (transport:'mock', canned findings) and a canned worker — the grammar talking to itself, which docs/research/loop-facade-postmortem.md warns against. This closes the same join on LIVE endpoints: a real cloud opencode worker (openSandboxRun) produces a real event trace, observe() reads it with a real router LLM, and the finding's recommended_action is injected as the next round's steer. The join ran live end-to-end for 3 rounds. Re-runs are currently blocked at provisioning by a sandbox egress regression (router.tangle.tools CONNECT-403 from inside the box; only that host — every other tangle host + provider egress passes), tracked as ops-board #984. So this proves the live JOIN; efficacy (does the steer improve behavior at equal budget) is gated on that unblock.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 76 | 89 | 76 |
| Confidence | 65 | 65 | 65 |
| Correctness | 76 | 89 | 76 |
| Security | 76 | 89 | 76 |
| Testing | 76 | 89 | 76 |
| Architecture | 76 | 89 | 76 |
Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM observer call has no timeout/abort signal — bench/src/cloud-loop.mts
L117-120:
await observe(...)is called without an AbortSignal. The per-round 240s timeout on L88 (setTimeout(() => controller.abort(), 240_000)) only abortsopenSandboxRun, and the timer is cleared on L109 before observe runs. If the router LLM inside observe hangs (network stall, model degradation), the script hangs indefinitely with no timeout. Thecontrolleris still in scope — passsignal: controller.signalinto the observe options to give it a hard deadline. The observe() function forwards opts.signal tochat.chat()(observe.ts:150), so the plumbing already exists.
🟡 LOW AbortSignal not propagated to observe() call — bench/src/cloud-loop.mts
Line 111:
observe(...)acceptsopts.signal(ObserveOptions.signal exists per src/runtime/observe.ts:46) but the cloud-loop does not passcontroller.signal. If the observe LLM call hangs, the round cannot be cancelled. The overall loop is bounded by ROUNDS, so this is not a hang risk, but it means a timed-out round's observer call continues burning tokens after the worker was already aborted. Pass{ chat, model, signal: controller.signal }to observe.
🟡 LOW Final status message checks current steers, not cumulative history — bench/src/cloud-loop.mts
Line 125:
steers.length ? 'steered ' : ''reflects only whether the LAST round had steers (steers is cleared and refilled each round at line 117-118). If the observer returned findings in round 2 but not round 3, the final message says 'rounds' without 'steered', which is misleading. Cosmetic only. Track a booleaneverSteeredif accurate reporting matters.
🟡 LOW no test coverage — bench/src/cloud-loop.mts
No tests exist for this file. The vitest config (vitest.config.ts:5) excludes
bench/**entirely, so even if tests were written they wouldn't run in CI. This is a bench/tooling script by design, but the verify() and tools() functions are pure and testable. Consider extracting them to a testable location or adding an integration check gated on env vars.
🟡 LOW observer failure unhandled — crashes the loop — bench/src/cloud-loop.mts
L117-120:
observe()is called outside the try/catch that protects the sandbox run (L91-108). If the router LLM returns a malformed JSON response, or the network errors,observe()throws → bypasses the catch on L105 → propagates tomain().catch()on L135 → logs and exits 1. The per-round error handling pattern (log, continue) is broken for the observer leg. Wrap in try/catch andcontinueon failure so a transient router blip doesn't kill the whole bench run.
🟡 LOW unnecessary as never type assertion — bench/src/cloud-loop.mts
L96:
fromEvents: (e) => answerOutput.parse(e as never). Theeparameter is typedSandboxEvent[]fromDeliverable<'events'>.answerOutput.parseacceptsReadonlyArray<unknown>perOutputAdapter<string>(experiment.ts:45).SandboxEvent[]is assignable toReadonlyArray<unknown>without any cast. Theas neveris dead code — remove it. If the cast was suppressing a real type error, the root cause should be fixed instead of papered over.
tangletools · 2026-06-08T14:30:03Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 6 non-blocking findings — 302af97e
Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-08T14:30:03Z · immutable trace
What
Adds
bench/src/cloud-loop.mts— the observe→steer loop closed on live endpoints:Why
PR #194 shipped
observe-steer-workspace-loop.mts, which proves the join through a mock observer (transport:'mock', canned findings) feeding a canned worker — "the grammar talking to itself," the exact patterndocs/research/loop-facade-postmortem.md(also in #194) warns against. This proves the same join with both ends real.The two are complementary surfaces, not duplicates:
observe-steer-workspace-loop.mts— exercises the Scope/Supervisor/coordination-MCP/git-workspace plumbing (mock ends, deterministic, no creds).cloud-loop.mts(this PR) — exercises the live worker + live observer path (openSandboxRun+observe()).Status (honest)
The join ran live end-to-end for 3 rounds (real worker → real trace → real router-LLM finding → real steer injection). Re-runs are currently blocked at provisioning by a sandbox egress regression:
router.tangle.toolsreturns CONNECT-403 from inside the box (only that host —id/pangolin/sandbox.tangle.tools andapi.openai.comall pass). It worked 2026-06-06 → platform regression, tracked as ops-board #984. So this proves the live join; efficacy (does the steer improve behavior at equal budget) is gated on that unblock.Follow-up (recommendation, not in this PR)
Once #984 is unblocked, run for efficacy. Separately, consider converting
observe-steer-workspace-loop.mtsinto a real CI unit test undertests/loops/(it currently runs as a standalonetsxdemo and asserts nothing), or retiring it now that the live join exists.Test
Code is byte-identical to the version that ran live for 3 rounds (only the header docstring changed).
bench/**is outside the root biome scope (consistent with siblingfleet.mts/workspace-loop.mts); build is verified-by-execution.